110
Binary Neural Architecture Search
We use Eqs. 4.30 – 4.31 to jointly learn the DCP-NAS and rewrite the objective function
in Eq. 4.26 as
LDCP-NAS( ˜f P (w, α),
˜f C
b ( ˆw, ˆα, β))
= −G( ˆw, ˆα, β) + λD(ˆα) + μLR( ˆw, β).
(4.32)
Then we optimize the binary architecture ˆα along the tangent direction of the real-valued
model, which inherits from the real-valued one. Note that when we set λ = 0, the Eq.
4.32 is equivalent to the objective of the original CP-NAS [304]. As revealed in [157], the
real-valued weights converge faster than the binarized ones. Motivated by this observation,
the tangent direction of the Parent supernet can be used to approximate the optimization
direction of the more slowly converged Child supernet. To conclude, in Eq. 4.31, we improve
the optimization of the Child architecture based on the tangent direction of the Parent
architecture, which leads the Child supernet to be trained more efficiently.
Considering that binary weights are learned by KL divergence, we optimize our DCP-
NAS as
∇ˆαLDCP-NAS( ˜f P (w, α),
˜f C
b ( ˆw, ˆα, β))
= −∂G( ˆw, ˆα, β)
∂ˆα
+ λ∂D(ˆα)
∂ˆα
= −∂G( ˆw, ˆα, β)
∂ˆα
+ 2λ(∂G( ˆw, ˆα, β)
∂ˆα
−∂˜f P (w, α)
∂α
)∂2G( ˆw, ˆα, β)
∂ˆα2
= −∂G( ˆw, ˆα, β)
∂ˆα
+ 2λ(∂G( ˆw, ˆα, β)
∂ˆα
−∂˜f P (w, α)
∂α
)HG(ˆα),
(4.33)
∇ˆwLDCP-NAS( ˜f P (w, α),
˜f C
b ( ˆw), ˆα, β) = −∂G( ˆw, ˆα, β)
∂b ˆw
∂b ˆw
∂ˆw ,
(4.34)
where
∂b ˆw
∂ˆw = 1| ˆw|≤1.
(4.35)
and λ is a hyperparameter; H ˜
fb(ˆα) = ∂2 ˜
fb( ˆw,ˆα)
∂ˆ
α2
denotes the Hessian matrix. The DCP-NAS
process is outlined in Fig. 4.12.
We minimize the difference between the gradient (tangent direction) ∂G( ˆw,ˆα,β)
∂ˆα
and the
gradient (tangent direction) ∂˜
f P (w,α)
∂α
, to search the architectures (both real-valued NAS
and binary NAS) in the same direction to generate a better 1-bit architecture. Note that α
inherits from ˆα at the beginning of each real value NAS iteration, indicating that we only
utilize w and α of real value NAS for heuristic optimization direction for 1-bit NAS instead
of looking for a better architecture for real value networks. Since a better tangent direction
∂˜
f P (w,α)
∂α
is achieved, DCP-NAS can have a more suitable ˆα for binary networks. We note
that α is different from ˆα, which is not an optimized architecture for real-valued weights
but an optimized architecture for 1-bit weights.
The expression above contains an expensive matrix gradient computation in its second
term. Thus, we introduce a first-order approximation of the Hessian matrix to accelerate
search efficiency in Section 4.4.5.
4.4.5
Generalized Gauss-Newton Matrix (GGN) for Hessian Matrix
Since the Hessian matrix is computationally expensive, this section mainly tries to accelerate
the calculation of the aforementioned Hessian matrix by deriving a second-order expansion
based on Eq. 4.34.